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Abstract The goal of variable clustering is to partition a random 
vector X G in sub-groups of similar probabilistic behavior. Popu¬ 
lar methods such as hierarchical clustering or /f-means are algorith¬ 
mic procedures applied to observations on X, while no population 
level target is dehned prior to estimation. We take a different view in 
this paper, where we propose and investigate model based variable 
clustering. We consider three models, of increasing level of complex¬ 
ity, termed generically G-models, with G standing for the partition 
to be estimated. Motivated by the potential lack of identifiability of 
the G-latent models, which are currently used in problems involving 
variable clustering, we introduce two new classes of models, the G- 
exchangeable and the G-block covariance models. We show that both 
classes are identifiable, for any distribution of X. 

Our focus is on clusters that are invariant with respect to unknown 
monotone transformations of the data, and that can be estimated in 
a computationally feasible manner. Both desiderata can be met if 
the clusters correspond to blocks in the copula correlation matrix of 
X, assumed to have a Gaussian copula distribution. This motivates 
the introduction of a new similarity metric for cluster membership, 

CORD, and a homonymous method for cluster estimation. Central to 
our work is the derivation of the minimax value of the CORD cluster 
separation for exact partition recovery. We obtained the surprising 
result that this value is of order ■\/log(p)/n, irrespective of the number 
of clusters, or of the size of the smallest cluster. Our new procedure, 

CORD, available on CRAN, achieves this bound, is easy to implement 
and has computational complexity that is polynomial in p. 

MSC 2010 subject classifications: Variable clustering, latent models, exchangeability, 
block covariance matrix, community estimation, minimax lower bound, consistent parti¬ 
tion estimation,copula models, high dimensional models 
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1. Introduction. The problem of partitioning a random vector X =: 
{Xi, ..., Xp) G MP in groups of variables that are similar in nature is known 
as variable clustering. Solutions to this problem are typically algorithmic 
in nature. Given Xi,..., X„ observations on X, one defines a dissimilarity 
measures between the data vectors of observations on two components Xj 
and Xi of X, and places the two variables in the same cluster if the data 
dissimilarity measure is small. The most popular measure is the Euclidean 
distance between vectors in M"' or, equivalently, the negative sample correla¬ 
tion and is typically the measure employed in popular clustering algorithms 
such as Hierarchical Clustering or ii'-means and their variants. We refer to 
[18] for an overview, and to clusters thus obtained as estimated clusters. 
Since no theoretical cluster target is postulated for X, prior to estimation, 
it is unclear what the clusters obtained in this manner estimate. 

In this work we take a different point of view, model-based clustering. 
We begin by considering models of probabilistic similarity between the com¬ 
ponents of X that will dehne identifiable target clusters, then use the data 
consisting in n observations on X to develop methods tailored to these tar¬ 
gets. We perform a detailed theoretical study of the problem of exact cluster 
recovery. This provides a bridge for the existing gap in the literature on vari¬ 
able clustering. 

We emphasize that our work is not on data clustering, for which a very 
large literature exists, centered around mixture models. We refer to [18] for 
a comprehensive discussion of related methods, and to [14] for foundational 
contributions. In data clustering, clusters are clouds of observations, and 
correspond to respective realizations of one of the mixture distributions, 
which is a distribution on the whole In contrast, in variable clustering, 
the effort is to define cluster models relative to subsets of components Xj of 
X G M^, making it a completely different problem. 

Our contributions and the organization of the paper are described below. 

1: Identifiable models for variable clustering. In Section 2, we introduce 
and study three models that can be used for variable clustering. The param¬ 
eter of interest, in each of these models, will be a partition G = {Gk}i<k<K 
of {1,... ,p}, for some unknown integer iA, and we will therefore refer to 
them, generically, as G-models. We introduce the following partial order on 
sets: 

Partition partial (PP) order: Let G = {Gk}k-, G' = {G'kfik' be two parti¬ 
tions of {1,... ,p}. We say that G' is a sub-partition of G if for each k' there 
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exists k such that G'j^, C Gk- We define the partial order < between two 
partitions G, G' of {1,... ,p} by G < G' if G' is a sub-partition of G. 

We seek a minimal partition (according to the PP order) fulfilling some 
conditions of interest, called model. Since a partially ordered set can have 
multiple minimal elements, our model will be said to be identifiable, if there 
exists a unique minimal partition in the set of partitions following the model. 
Our first contribution is to define models for clustering that are identifiable 
in this sense, without any further distributional assumptions. 

We first consider a model that is already popular for variable clustering, 
the G-latent model, which clusters components Xa of X that have the same 
latent generator . Although not analyzed theoretically in the manner pro¬ 
posed here, the G-latent models are widely used in applications ranging from 
biology to financial data. For instance, for analyses involving fMRI data, we 
refer to [24] and [8], and for a general overview to [25]. We show in Section 
2.5.1 that one can construct vectors X that are latent with respect to two 
different minimal partitions. Thus, the G-latent models are not identifiable, 
in general, without further assumptions. 

To address this issue, we introduce two new classes of models, of increas¬ 
ing complexity. The first is the class of G-exchangeable models, for which 
the within-group variables (Xa)a£Gk exchangeable within X, but the 
between group variables are not. The second, larger, class of models that we 
propose is that of G-block covariance models, which are covariance models 
with a block-like structure: off-diagonal entries within a block are equal. We 
show in Sections 2.2 and 2.3, respectively, that both models are identifiable, 
with the latter being easier to estimate. 

Moreover, in Section 2.5., in which we restrict ourselves to Gaussian 
distributions, we provide a full characterization of identifiability in Gaus¬ 
sian G-latent models. We also provide sufficient conditions under which the 
identifiable partitions for the three G-models under consideration coincide. 

2: Identifiable clusters that are invariant with respect to monotone trans¬ 
formations of the data. Whereas the effort in Sections 2.2 and 2.3 is con¬ 
centrated on defining models for clustering in full generality, care must be 
exercised when using them in real applications, as clusters must be invari¬ 
ant to simple scale or shift transformation of X. Guided by the trade-off 
between distribution complexity and computational feasibility, we focus in 
this work on the class of what we term Gaussian copula G-models, intro¬ 
duced in Section 3. Within this class, we define, more generally, identifiable 
target clusters that are invariant relative to unknown monotone transforma- 
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tions of the data. With a view towards computational feasibility, our focus 
for the remainder of this article is on clusters that correspond to blocks of 
the copula correlation matrix R. To the best of our knowledge, our contri¬ 
butions 1 and 2 are new in the literature. 

3: Cord, a new metric for variable clustering. Central to our work is a 
new, and simple, dissimilarity metric for variable clustering, invariant under 
monotone transformations of the data. Standard clustering techniques em¬ 
ploy the correlation between two variables. In contrast, our metric is based on 
CORrelation Differences (Cord), a measure that attempts to capture how 
two variables behave relative to the rest, rather than relative to each other. 
We discuss this in detail in Section 4, where we also argue that MCord, 
which is a measure of separation between clusters induced by CORD, can 
be used to define distinct clusters, even when the popular cluster separation 
measure, the within-between cluster correlation gap, is negative. 

4: The minimax MCORD lower bound for exact cluster recovery. The 
strength of any variable clustering procedure is given by its ability to recover 
minimally separated clusters. Therefore, prior to developing an estimation 
procedure, it is crucial to assess the minimal size of the cluster separation 
below which no algorithm can be expected to recover consistently the correct 
partition. In Section 5 we present the main theoretical result of our work, 
the minimax MCORD separation rate for exact cluster recovery. 

Minimax-type results are sometimes viewed as being overly pessimistic, 
since they are typically derived by considering very special distributions, cor¬ 
responding to the worst case scenario for the problem under consideration. 
From this perspective our results are non-standard. Let m =: rnini<k<K\Gk\ 
denote the minimum cluster size. Theorem 2 shows that, for variable clus¬ 
tering, the minimax separation rate under the MCord metric is Y^log(p)/n, 
in a variety of cases, summarized below, irrespective of the size of K and m, 
or of the size of the correlation between variables in distinct groups: 

• A case that is believed to be the most difficult for any clustering algorithm, 
corresponding to K = pl‘1 clusters of size m = 2 each. This is Proposition 
8, in Appendix A.2. 

• A case with in which we have LL = 2, m = 2 and consequently the other 
cluster has size p — 2. This is Proposition 7, in Appendix A.2. This shows 
that having fewer clusters does not simplify the problem. 

• A surprising case, with K = 3 and m = p/3, presented in Proposition 9 in 
Appendix A.2. This shows that the separation rate, under MCORD, is not 
influenced by m or K, as it is still ^/log{p)/n, even though we have very 
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few, K = 3, clusters, each of large size, p/3. 

• Moreover, our examples show that the same rate holds when variables in 
distinct groups are independent or when they are strongly correlated. 

These results are, to the best of our knowledge, the first of their kind in the 
literature on clustering. 

5: A new variable clustering algorithm, Cord. In Section 6 we develop 
Cord, a new clustering algorithm. It has moderate computational complex¬ 
ity, polynomial in p, and it is already freely available on-line, on CRAN. We 
prove that CORD can recover correctly the target partition, with high prob¬ 
ability, at the MCord minimax cluster separation rate. Section 7 contains 
summaries of our extensive numerical studies that show that CORD recovers 
clusters defined by our G-models, while existing variable clustering algo¬ 
rithms, such as iC-means or hierarchical clustering based on the Euclidean 
distance, typically fail to do so. Section 7.4 provides further numerical evi¬ 
dence on the qualities of the newly introduced CORD metric as a bona-fide 
dissimilarity measure for variable clustering. We conclude with two real data 
examples, one of which is presented in the supplemental material. 

1.1. Contrast with clustering from network data. The problem of grouping 
p entities in communities is also encountered in network analysis, and the 
estimation of these communities is known as the problem of community de¬ 
tection. However, the nature of the data is different. One observes a network 
of p objects, and associates with it a p x p adjacency matrix N: if there is a 
link between i and j, Nij = 1, whereas the absence of a link yields Nij = 0. 
The observed entries of N are assumed to be realizations of mutually in¬ 
dependent Bernoulli random variables. One of the most basic models used 
to define communities in this context is the Stochastic Block Model (SBM), 
dating back to [15]. The model has been extensively studied in the past few 
years, and we give here only an incomplete list of references, most closely re¬ 
lated to our work: [2], [21], [6], [11], [1], [20]. SBM assumes that E(A') has a 
block structure. In our work, we will employ the G-block correlation model, 
that postulates that the off-diagonal entries within a block of a correlation 
matrix R are equal. Since both problems involve block matrices, they may 
appear similar. However, there are fundamental differences between cluster¬ 
ing based on SBM and clustering based on G-block correlation models, and 
we outline some of them below. 
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Data has different nature. The data generated from a SBM is a p x p bi¬ 
nary matrix as above. The data generated from a G-model consists in n 
observations of a p dimensional vector, and can be organized as an n x p 
matrix of real numbers, with independent rows. We highlight two important 
repercussions of this difference. 

Assume that in either SBM or the G-block correlation model the number 
of clusters K remains constant, as p increases. Assume also that n is fixed in 
the context of variable clustering, when p increases. Clearly, the clustering 
task in SBM becomes easier as p grows, since we have more data, whereas in 
G-models it deteriorates, as the quality of any estimator R oi R deteriorates 
when p grows but n is set to a given value. Thus, while clustering a large 
network is statistically easier than clustering a small one, the opposite is true 
for variable clustering: clustering a high dimensional vector is harder than 
clustering a low dimensional vector, from the same amount n of observations. 

Since N has independent entries and R has correlated (possibly strongly) 
entries, the statistical properties of the error terms involved in each prob¬ 
lem, Vi =: R — E(i?) and V 2 =: N — E(A^), respectively, are very different. 
In SBM, cluster recovery typically requires control of the operator norm of 
V 2 , see for instance [21]. We will show in Section 6 that, fortunately, for 
G-models one can construct consistent cluster estimators that only require 
control of the £00 norm of Vi, which is of order y^log(p)/re, and not of the 
operator norm of Vi, which is of the order of y^||i?||optracei?Y^log(pn)/n, 
see for instance [5]. Using the later may unnecessarily inflate the size of the 
minimal cluster separation needed for exact recovery. 

Different cluster separation metric. Most SBM analyses focus on the case 
where there is a positive gap for the within-between community probabilities 
and investigate the minimal size of this gap for consistent clustering. As 
explained in the previous section, the quantity of interest for cluster recovery 
in G-models is not the correlation gap, but instead the newly introduced 
MCord separation. One of our main contributions is the investigation of the 
minimal MCORD separation needed for successful clustering in G-models, 
which is our newly proposed metric for cluster separation. To the best of 
our knowledge, this question has not been posed in the SBM literature, and 
therefore not answered elsewhere. 

2. Three models for variable clustering. Let X G be a zero mean 
random vector. Let G = {Gk}k=i,...,K be a partition of {!,... ,p}, for some 
integer K, 1 < K < p. In what follows we introduce and compare three 
different models for the distribution of X that may potentially lead to dif- 
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ferent clusters Gk and different final partitions G. We begin by discussing 
partition identifiability. 

2.1. G-latent models. Perhaps the most popular model connected to vari¬ 
able clustering, via the literature on clustering algorithms, is the simplest 
form of a latent variable model, as defined below. We will refer to this model 
as the G-latent model. Let G be a partition of {1,... ,p} into K groups and 
let k : {1,... ,p} —)■ {1,... ,K} be an index assignment function defined by 
Gfc = {a : k{a) = k}. 

Definition 1. The G-latent model. The zero mean random vector X 
follows a G-latent model if there exists a zero mean random vector Z = 
{Zi,..., Zk) C such that Xa = Zj.(^a) + -®aj 1 < a < P, where the 
mutually independent zero mean error terms Ea have variance 'yk{a)j 
are independent of the latent variables Z. If a is a singleton, we use the 
convention ^k{a) = 0- 

Let C[X) = {G : X is G-latent} . The set C(X) is nonempty and finite, 
so it does have minimal elements G^. We will argue in Section 2.5.1 that, 
however, the minimal elements are not necessarily unique. Thus, without 
further assumptions, the G-latent models are not identifiable. This motivates 
the introduction and study of other classes of models for variable clustering, 
that are more flexible and are identifiable in full generality. We return to the 
class of G-latent models and offer sufficient conditions for their identifiability 
in Section 2.5.2. 

2.2. G-exchangeable models. To introduce a new class of models for clus¬ 
tering, we will use the following intuition: two variables that belong to what 
will constitute a group in the partition should have similar behavior relative 
to all the other variables. One way to characterize this behavior is in terms 
of the joint distribution of X = (Xi,... ,Xp). Below we will define parti¬ 
tions with the property that, in particular, two variables within the same 
group can be switched without changing the distribution of the vector, while 
switching variables between groups will change this distribution. 

Formally, for a partition G, we define below the set Sq of permutations a 
of {1,... ,p} that only permute elements within each group of the partition, 
but not between groups. Let S{Gk) be the set of permutations ak of Gfc. 
Thus, Sg = {cr = ai.. .ax ■ CTk ^ <S{Gk)} • We let X^ := (X^(i),... ,X^(p)). 

Definition 2. The G-exchangeable model. The distribution of X is 
G-exchangeable (denoted byXr^G) ifX.„ X, for all a G Sq. 
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It is immediate to see that if X is G-latent, then X is G-exchangeable, but 
the reverse is not necessarily true, as the following example shows. 


Example 1. Let ^ be a zero mean Gaussian vector with covariance matrix 


1 - 1/2 0 0 

- 1/2 1 0 0 

0 0 1 - 1/2 

0 0 - 1/2 1 


Then JL is G =: {{1,2}] {2, 3}}-exchangeable, but not G-latent. To see why 
notice that S 12 and S 34 are negative, and, for instance, S 12 should have 
been positive, as it would be equal to var(Zi) if X. were G-latent. Observe, 
moreover, that in this example the within-group correlations, which are all 
equal to —1/2, are smaller than between group correlations, which are all 
equal to zero. 


2.2.1. G-exchangeable models are identifiable. Let £^(X) = {G : X rsj G}. 
Observe that the set S{X) is not empty, as X is always G-exchangeable ac¬ 
cording to the largest possible partition with p groups, G = {{1} ,..., {p}}. 
Theorem 1 below shows that <S(X) has a unique minimum, for a generic 
random vector X, without any further distributional assumptions. We give 
a constructive proof of this result, that necessitates the definition of the 
minimum between two partitions. For any partition G = {G}f}i<k<K^ we 

write a ~ 6 if there exists k G {1,..., K} such that a,b £ G^. If for some a 

Q 

there does not exist b a such a ^ b, we call a a singleton. 


Definition 3. For any two partitions G,G' of {1,... ,p}, we define GaG' 

GAG' 

as the partition induced by the equivalenee relation a ^ b iff there exist 
d £ N, and ci,... ,Cd £ 11 ,...,p| such that a ~ ci ~ ... ~ 

b. Informally, a b iff there exists a path through groups of G 
and G' which connects a to b. 


It is straightforward to check that ~ is indeed reflexive, transitive and 
symmetric. With this definition, we have the following interesting properties 
that lead to the desired result, the identifiability of the newly introduced G- 
exchangeable models. 


Theorem 1. Let X be a zero mean random vector with arbitrary distribu¬ 
tion. 
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1. Consider G < G'. Then X ^ G implies X ^ G'. 

2. G AG' <G and G AG' <G' 

3. IfXr^ G andXr^ G', then X ~ G A G'. 

4- The set £{X) has a unique minimum, G^{X), with respect to the PP 
order. 

We refer to Appendix A.l for a proof of this theorem. It shows that the 
minimum partition for which X is G-exchangeable always exists and its 
structure is intrinsic to the distribution of X. The minimal partition G^(X) 
matches our intuition regarding what would constitute a reasonable clus¬ 
tering target. However, from a statistical point of view, estimating G^{X) 
may be challenging in high dimensions, as it will typically require estimation 
of high dimensional distributions and multiple testing. This motivates the 
introduction of a more general model, described in the following section. 

2.3. G-block covariance models. Clustering according to G-exchangeable 
models may be too stringent, as it imposes conditions on the whole dis¬ 
tribution of X. In our third model, proposed below, we suggest clustering 
according to functionals of the distribution. In particular, we will consider 
the group structure on the covariance matrix induced by G-exchangeability. 

Definition 4. G-block structure. Let G be a partition of {1,... ,p}. 
The covariance matrix Yi of X ^W’ is said to have a G-block structure if 

Q 

• var{Xa) =: T^aa = '^bb := var{Xb), if a b 

G G 

• cov{Xa,Xa') =: T,aa' = ^bb' ■= cov{Xb,Xb>), for any a b, a' b' 
and a ^ a', b ^ b'. 

A p X p covariance matrix with K blocks has a reduced number of distinct 
parameters, K{K — l)/2 -|- K, as formalized in the result below. We write 
|Gfc| for the cardinality of Gk- With this notation, we list below a number of 
elementary, but very useful, properties of a covariance matrix with a G-block 
structure, proved in Section 1.1 of the supplemental material [4]. 

Lemma 1. A covariance matrix S = cov{X) with a G-block structure has 
the following properties: 

1. There exist K{K — l)/2 real numbers c^k' = Ck'k; 1 < k,k' < K, such 
that Tiab = CkkA for all a b with k{a) = k and k{b) = k'. 

2. There exist K non-negative numbers dk such that, for all a, T^aa = dfc- 

3. For all 1 < k < K, we have — < Ckk < dk. 
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2.3.1. The G-block covariance models are identifiable. For a vector X G 
with covariance matrix S let B{X) = {G : S has G-block structure}. We 
show that G-block covariance models are identifiable by constructing the 
unique minimum element of B{X). 

Definition 5. Let G^^°^^{X) he the partition induced by the equivalence 

Qhlock 

relation: a ^ b iff var{Xa) = var{Xi,) and cov{Xa, Xc) = cov(Xf,, Xc) 
for all cfi^ a,h. 

It is immediate to verify that this relationship is reflexive, symmetric and 
transitive. 

Proposition 1. The partition G^^°^^{X) is the minimum of B{X). 

We refer to Section 1.1 of the supplemental material [4] for a proof. Propo¬ 
sition 1 shows that, similarly to the G-exchangeable models, the class of the 
G-block covariance models is identifiable, for general distributions. Since 
£(X) C B{X), we always have G'^(X) > and the two partitions 

may not coincide, in general. However, there are situations where they do, 
and we discuss them in Section 2.5 below. 

2.4. Connections between the three G-models. We have the following im¬ 
mediate relations between the three models introduced above. We omit the 
proof. 

Proposition 2. Let X he a zero mean random vector in with covari¬ 
ance matrix S. For any partition G of {1,... ,p}, we have: 

X is G-latent ^ X is G-exchangeable => S has G-block structure. 

We show below that the three models are not equivalent. To see why, we 
first associate to a p x p G-block covariance matrix a K x K matrix G 
defined below, using the notation introduced in Lemma 1. 

Definition 6. The matrix of covariances G. Let Yi be a G-hlock co- 
variance matrix. Let G he defined by 

• Cfcfc/ = Yab, for all a b with k{a) = k and k{b) = k'. 

• if a is a singleton, with k{a) = k, we set Ckk = T‘aa = dk. 

The matrix G = {ckk'}i<k,k'<K is well-defined for any k,k' and it is sym¬ 
metric. By Lemma 1 we always have dk > Ckk, with equality for singletons. 
The matrix G collects the within and between group covariances and is not 
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always semi-positive definite. For instance, the matrix C associated with 
Example 1 is 


C 


- 1/2 0 
0 - 1/2 


which is negative definite. There are however situations where C is positive 
semi-definite, as the following proposition shows. Its proof is immediate and 
therefore omitted. 


Proposition 3. /f X follows a G-latent model, then S has G-block struc¬ 
ture with associated matrix G = cov{Z). Therefore, in this case G is a 
semi-positive definite matrix and 7fc(a) = <^fc(a) “ (^k(a)k{a) ^ 0, for all a. 

Remark 1. The three G models are not equivalent: Proposition 3 shows 
that when S has a G-block structure, if the matrix G given by Definition 6 is 
not positive semi-definite, then X cannot admit a latent representation with 
the same G. 


2.5. Gaussian G-models. The theoretical results of the previous sections 
hold in full generality. More connections between the three G-models can 
be established if one makes distributional assumptions. In this section we 
study vectors X with a Gaussian distribution. In this case, we give a full 
characterization of identifiability in G-latent variable models, and connect 
this back with G-block covariance models and G-exchangeable models. 

2.5.1. The G-latent models may not be identifiable. In this section we re¬ 
visit the G-latent models and construct a counter-example that shows that 
the same vector X has two different latent representations, with respect 
to two different minimal partitions. In our construction, we will use the 
following simple lemma, proved in Section 1.1.3 of the supplement [4]. 

Lemma 2. /f X has a mean zero Gaussian distribution with covariance 
matrix S that is G-block structured, and if the matrix C given by Definition 6 
is positive semi-definite, then X follows a G-latent model. 


Example 2. Consider the positive definite matrix 

1 0.25 0.26 0.26' 

0.25 1 0.26 0.26 

0.26 0.26 1 0.25 

0.26 0.26 0.25 1 
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and'K with Gaussian Af{0,T,) distribution. We have {X) = {{1, 2} ; {3,4}}. 
We use Definition 6 to construct the C-matrix associated to S relative to 

/^block 





0.25 0.26 
0.26 0.25 


which is not a positive semi-definite matrix. Therefore, by Proposition 3, X 
cannot follow a {X)-latent model. Consider now G = {{1} ; {2} ; {3,4}}. 
The C-matrix associated to S relative to G is 


C{G) 


1 

0.25 

0.26 

0.25 

1 

0.26 

0.26 

0.26 

0.25 


which is positive semi-definite. Therefore, by (i) of Lemma 2, X follows 
a G-latent model. By symmetry, X also follows a G'-latent model with 
G' = {{1, 2} ; {3} ; {4}}. The two partitions G and G' are both minimal 
for the latent property with respect to the partial order PP and C{X) = 
{P:G<PorG'< P}. 

This example suggests that either X is G^^°‘^^(X)-latent or its latent struc¬ 
ture will contain singletons. This is indeed true, as formalized below. 


Proposition 4. Assume X is a mean zero Gaussian vector. Let G^ be a 
minimal element in C{X). Assume that X is not G^^°^^{X)-latent. Then G^ 
has at least one singleton. 


We refer to Appendix A.4 for a proof. A consequence of Proposition 4 is: If 
a minimal partition G^ in C{X) has no singletons, then G^ = G^^°^^{X) and 
it is the unique minimum of C{X). 

2.5.2. Identifiable Gaussian G-models. The following proposition summa¬ 
rizes the connection between Gaussian G-latent, G-exchangeable and G- 
block covariance models, and highlights a sufficient condition under which 
the minimum latent partition of X is G^^°^^{X), henceforth ensuring the 
uniqueness of the latent structure. We refer to Section 1.1.4 of the supple¬ 
ment [4] for a proof. 

Proposition 5. Let X be a zero mean Gaussian distribution with covari¬ 
ance matrix S. Let G be the matrix associated with S and a partition G via 
Definition 6. Then: (i) For any partition G of {1,... ,p} we have that 
X is G-exchangeable S has a G-block structure. Thus, G^(X) = G^^“^(A). 
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(a) The following chain of equivalences holds: X is G-exchangeable and C > 
0 S has a G-bloek structure and C > 0 X. is G-latent. 

(Hi) If the matrix of covariances C corresponding to the partition G^^°^^{X) 
is positive-semidefinite, then X is -latent, and this is the unique 

minimal partition aceording to which X admits a latent decomposition. 

3. Scale and translation invariant clnstering via Ganssian copula 
G-models. If all the components of X have different variances, none of 
the proposed G-models would hold, despite possibly very strong reasons for 
grouping these components. Clearly, one should first make all components 
of X comparable at least in terms of scale and location, and postulate a 
model for clustering for a transformation of X, not necessarily X itself. 

Formally, we assume that there exists an unknown transformation h of X 
such that Y =: /i(X) follows a G model. In Section 2 above we showed that 
the partition G is identifiable, under no further assumptions, if Y follows ei¬ 
ther a (a) G-exchangeable or a (b) G-block covariance model for clustering. 
In full generality, if we choose model (a), the estimation problem is diffi¬ 
cult and possibly NP-hard, as it will involve estimation of high dimensional 
density functions and multiple testing in high dimensions. Moreover, not 
knowing the transformation h adds another layer of difficulty to an already 
complex problem. 

We propose therefore to solve the problem for a particular class of distri¬ 
butions for X, that of Gaussian copula distributions. We assume for the rest 
of this section and paper that X has a Gaussian copula distribution with 
zero mean, and copula function with parameters p, = 0 and R, a correlation 
matrix, and we write X = C{R). Recall that this implies that 

(1) Y := (Yi,..., Yp) := (hi(Xi),..., hp{Xp)) =: h(X) ^ AAp(0, R), 

with ha '■= o Fa, for each a, where ‘k is the c.d.f of a standard Gaussian 
random variable. Fa is the marginal c.d.f. of Xa and Var(Ya) = 1, for all a. 

Definition 7. Assume that X has a Gaussian copula distribution with zero 
mean, and copula function with parameters p = 0 and R. If the Gaussian 
vector Y given by (1) follows a G-model, we say that X follows a Gaussian 
copula G-model. 

To emphasize that the covariance matrix of Y is a correlation matrix, 
whenever Y follow a G-block covariance model, we will call it G-block cor¬ 
relation model. 


imsart-aos ver. 2014/10/16 file: C0RD-REVISI0N.Aprill.2016.tex date: April 1, 2016 


14 


F. BUNEA, C. GIRAUD AND X. LUO 


By Proposition 5 (i), the G-exchangeable and the G-block covariance 
models are equivalent for Gaussian vectors. Thus, either model assump¬ 
tion on Y will lead to the same identifiable target partition. Moreover, by 
Proposition 5 (iii), if the matrix C that can be associated with the correla¬ 
tion matrix R via Definition 6 is non-negative definite, of R will also 

be the partition according to which Y has a latent decomposition. 

Therefore, if X follows a Gaussian copula G-model and C is non-negative 
definite, we have the same target partition, irrespective of the particular type 
of G-model we postulate. Thus, if one prefers either the G-exchangeable or 
the G-latent model for their interpretability, one can use G-block correlation 
models for estimation, if it turns out that estimation is computationally 
efficient in that model. As we show in Section 6 below, this is indeed the 
case. We therefore focus for the rest of the paper on clustering via G-block 
correlation models, relative to the copula correlation matrix R. 

4. The CORD similarity metric and its induced cluster separation 
metric. Classical clustering methods, such as iP-means and hierarchical 
clustering, typically build clusters for which the within-group correlation 
is larger than that between groups. Therefore, if R has a block structure 
with associated matrix of correlations C, and Cjk denotes the correlation 
between variables in group Gj and those in Gk, one expects that consistent 
clustering would be possible with these classical clustering algorithms when 
the within-between correlation gap 

(2) min(cjy - Cjk) 

is large enough. However, as our Example 1 shows, this requirement can be 
easily violated, and the gap (2) can be negative. 

If clustering is done according to G-models, we argue below that the within- 
between correlation gap (2) is no longer the quantity that governs cluster 
separation. We dehne a new cluster separation metric that is positive, and 
can be large even if (2) is negative, thus offering an alternative to classical 
clustering. 

To motivate our metric, notice that in G-block correlation models, if two 
variables with indices a and b are in the same group, then Rac = Rbc, for 
any c a,b. Conversely, if a and b are not in the same group, then Rac 7 ^ Rbc 
for at least one variable c a,b. Therefore, two variables Xa and belong 
to the same cluster defined by the partition of R if and only if 

max \Rac - Rbc\ = 0. 

c^a^b 
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This motivates the introduction of a new disimilarity metric for variable 
clustering 

Cord (a, b) := max \Rac — Rbc\, 

c^a^b 

which is tailored to G-models. The key quantity that quantifies the difficulty 
of clustering in G-models is the minimal CORD separation between two 
variables in distinct groups: 


(3) MCord(R) = min CORD(a, 6). 

riblock 

a no h 

Then, the larger the value of MCORD(i?), the easier the cluster detection 
problem. We emphasize that requiring MCORD(i?) be larger than rj, for 
some ?7 > 0, is in general much less stringent than requiring that the within- 
between correlation gap (2) be larger than ry. To see why, notice that when 
the partition has no singletons, we always have 

MCORD(i?) > min(c,o — Cjk)- 

Referring again to Example 1, it may happen that the right-hand side is 
negative, while the left-hand side is large. 

In Section 5 below we study the minimax optimal separation size for consis¬ 
tent partition estimation with respect to the newly introduced MCORD(ii). 
The investigation of the minimax optimal correlation gap (2) ensuring con¬ 
sistent partition estimation, is considered in [3]. 

5. The minimax lower bound for cluster separation with respect 
to MCord(i?). We always have MCORD(ii) > r], with ry = 0. However, 
a larger value of rj will be needed for retrieving consistently the partition 
Qbiock fj.Qm noisy observations. We give its minimax optimal value below. 
To begin with, we define a general class of models over which we establish 
our minimax lower bounds. Let R be the copula correlation matrix of X, 
assumed to have an arbitrary Q^^°^k structure. We define 

7^(ry) = {R : MCord(R) > ?y}. 

Our goal is to find the minimal value ry* below which exact recovery with 
high-probability is impossible. Formally, we will show that 

inf sup Pj?(G / Qbiock'^ y constant > 0, 

G R&n{r,) 

for all 0 < ry < ry*, where the infimum is taken over all possible estimators. 
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Theorem 2. 


we have 


0 - 6 a/^ 

For any 0 < tj < rj* := - , 

1 + 0 . 6 ^/^^ 


inf sup Pij(G / > 

G Re7^(r?) 


1 ^ 1 
2e + 1 “ 7 ’ 


where the infimum is taken over all possible estimators. 


We observe that V* - 0.6^^ when log(p) = o(n), as announced in the 
discussion in the Introduction. The proof is given in the Appendix A.2 and 
consists in the collection of the proofs of Propositions 7, 8 and 9, each show¬ 
casing a different scenario for the number of clusters, the size of the smallest 
cluster and the correlation between and within clusters. In all these cases, 
the minimax cluster separation rate is Y^log(p)/n, a new and surprising 
result in the literature on variable clustering. 


6. Consistent estimation of minimally separated clnsters via Cord. 

In this section with present an algorithm that achieves the separation lower 
bound, for Gaussian copula G-block correlation models. We recall that we 
place variables Xa and X^, in the same group when their transforms Ya and Yf, 
are in the same group, and the latter happens if and only if CORD(a, b) := 
maxc/a,6 \Rac — Rbc\ = 0) where R is the copula correlation matrix of X. 
Since CORD does not use the size of the correlation of a pair of variables 
to decide on group membership, estimation of clusters in G-block models 
cannot be performed in the standard manner of thresholding an estimator 
of R, followed by taking the connected components of the resulting graph. 
Instead, our estimator will be based on thresholding an estimator of CORD, 
in an iterative fashion, as we explain below. To estimate CORD, we only 
require estimates of the entries of R from n observed independent copies 
Xi,..., X„ of X. Let R be an estimator of R. We first recall that, under the 
Gaussian copula model on X, 

TT 

Rab = Cov{Ya,Yb) = Corr{Ya,Yb) = Sm{-Tab), 
where Kendall’s tau coefficient is defined by 




sign((ya 


Ya){Yb-Yb)) 


E 


sign((Xa 


Xa){Xb-Xb)) , 
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with Xa {Ya) and Xf, (Yb) independent copies of Xa (Ya) and Xi, (Yb), re¬ 
spectively. Therefore, we can estimate Rat by 


( 4 ) 




Rab = sin(-fafe), 


where 


Taft — 


2 

n{n — 1 ) 


l<^<2^<n 


Xf)sign(X^-Xf). 


If we have reasons to believe that X has a Gaussian distribution, then we can 
alternatively consider as an estimator of R the sample correlation matrix. 
Given the entries of R, the goal is to estimate the minimal partition 

with respect to which R has a block structure. To a given estimator R of R 
we associate the estimator 


( 5 ) 


CORD(a, 6) := max\Rac - Rbc\, a,b e ,p} , 

c^a,b 


of the Cord metric. We then estimate the partition G according to the 
following algorithm. 

Algorithm 1 

• Input; R and a > 0 __ 

• Initialization; S = p} and CORD(a, 6) = max |i?ac — .Rfed 

c^a^b 

and I = 0 

• Repeat; while 5/0 

- l^l+l 

- If |5| = 1 Then Gi = S 

- If |5| > 1 Then 

* {ai,bi) = argmin CORD(a, 6) 

a^b^S, a^b 

* If CORD(aj, bi) > a Then Gi = {ai} 

* If CORD(aj,6;) < a Then 

G; = |s G 5 ; CoRD(a;, s) A Cord(6j, s) < a| 

- S\di 

• Output; the partition G = (Gd;=i^...^fc 
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We emphasize that this algorithm does not require as input the specification 
of the number K of groups. The algorithmic complexity for computing R 
defined by (4) is 0{p^n\og{n)) (see [7]) and the complexity of Algorithm 1 
is 0{p^), so the overall complexity of our estimation procedure is 0(jP‘{p\/ 
n log (re)). In the following, we provide conditions ensuring that G = . 

Proposition 6. Let 'X. be a zero mean random vector with a Gaussian 
copula distribution with parameter R. Let R be any estimator of R. We 
define t = \R — i?|oo and we consider two parameters {a,r]) fulfilling 

(6) a > 2r and rj > 2t + a. 

Then, if R € R-ip), our algorithm yields G = . 


We refer to Appendix A.3 for a proof of this proposition and to Section 1.2 
of the supplemental material [4] for the proof of the following corollary. 

Corollary 1. Let us consider {a,p) fulfilling 

/(! +A) log(p) /(I + A) log(p) 

a > 4711/- and p > Att\ -h a, 

V re V re 

for some A > 0. //X has a zero mean Gaussian copula distribution and 
R G R-ip), then the output of our algorithm applied to the estimator R given 
by (4), is consistent: G = ^ with probability higher than 1 —p~^^. 


Remark 1. Therefore, with r < constant x y ^ thresholding CORD 
at level 2r guarantees exact recovery, whenever the CORD separation p is 
at least 4t. Whereas the exact values of the constants will depend on the 
particular parameters of the distribution. Theorem 2 in the previous section 
guarantees that the order of the CORD separation is tight, since we showed 
that there exists constant* such that if the Cord separation equals p* = 

constant* x exact recovery is impossible, by any method. 

If X has a Gaussian distribution, we can use the sample correlation 
matrix for R. The results of Corollary 1 above remain unchanged, modulo 
constants, and are presented in Corollary 1 of Section 1.3 of the supplement 

[ 4 ]. 


7. Simulations. 
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7.1. Simulation design. In this section we verify numerically our theoreti¬ 
cal findings. We consider a number of types of G-block covariance matrices, 
of increasing level of complexity. To construct them, we recall Lemma 1 
of Section 3.1, that shows that the G-block structure for S is specified by 
C = {ckk') and D = (dfc), 1 < k,k' < K. We note that once S is constructed 
for our simulation models, the correlation matrix R has the same G-block 
structure. 

We consider the following models for S corresponding to the following matrix 
C: 

• Model 1: C is block diagonal and each block is a 2 x 2 block matrix 
B, where Bn = 0.6, B 22 = 2, B 12 = B 21 = 0.8. 

• Model 2: C = B"^B where B is a random K x K matrix with in¬ 

dependent entries. Each entry takes the value -|-1 and —1 with equal 
probability 0.5 x and the value 0 with probability 1 — 

• Model 3: C = [B — B^ ~ where B is randomly generated as 
in Model 2 and B^k' = K~^ Ylk=i ^kk' ■ 

The matrix C is positive definite in Models 1 and 2, and it is positive semidef- 
inite in Model 3. We have also considered negative definite matrices C by 
subtracting a small positive number from all the diagonal entries of the C in 
Model 3, and the results are similar. In all these models, we set dk = Ckk +1; 
k = I,... ,iL, to ensure that the resulting S (and R) is positive definite. 
We set iC = 10 to be the the number of equal-size groups of variables, for 
each of two representative cases p = 200 and p = 1600. For each p, we run 
simulations for samples of sizes n = 100, 200,300,..., 1000. All simulations 
are repeated 100 times. 

We simulate the data from either a Gaussian or Gaussian copula distribu¬ 
tion. In the Gaussian setting, the observations on X are simulated from a 
J\f (0, R) distribution, with R having the block structure given by Models 
1 - 3. To simulate Gaussian copula observations, we apply transformation 
/ {x\vk) to each variable Xa if o G for X simulated from AA(0,i?). The 
transformation functions take the form 

/ {x\vk) = log (^1 -L |x - • sign(x - Vk) 

where the parameters Vk, k = 1,... ,K, are iid from a uniform distribution 
between (—1,-|-1). The distribution of each variable after transformation 
is bimodal. We obtained very similar results in the Gaussian case and the 
Gaussian copula case. For brevity, we only report the result for the Gaussian 
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copula case in this section and we refer to Section 2 of the supplemental 
material [4] for the results in the Gaussian case. 

7.2. G-Block Estimation. The goal of our algorithm is to create sub-groups 
of vectors of dimension n, from a given collection of p vectors of observations, 
each of dimension n, according to CORD. This task can be viewed as that of 
clustering p objects in M". The existing clustering algorithms are not tailored 
to recovering groups with this structure, but they can serve as comparative 
methods. We thus compare the performance of CORD with two popular 
clustering algorithms: K-means and Hierarchical Clustering (HC). We apply 
K-means on the columns of the n x p matrix of scaled observations, and we 
use the negative correlation as distance matrix in HC. Both Kmeans and HC 
require specification of the number of groups K. We use the true iA = 10 in 
these methods to evaluate their oracle performance. In our simulations, both 
Cord and HC use estimates of Pearson’s correlation under the Gaussian 
setting, and of the sinusoid transform of Kendall’s tau in (4) under the 
Gaussian copula setting. In the following simulation study, we use the hxed 
threshold a = 2^/log{p)/n, since our theoretical results suggest the usage 
of a threshold proportional to Y^log(p)/n with a constant larger than 1. We 
also provide a data-dependent choice in the next section. We consider the 
set-up described above, with n = 100, 200 ,..., 1000 and p = 200,1600. 

7.2.1. Exact recovery. We first compare the performance CORD and its 
competitors in terms of exact true group recovery. Figure 1 shows the per¬ 
centages of exact recovery by K-means, HC, and Cord. Cord clearly out¬ 
performs both K-means and HC when n is about 400 or larger in all the mod¬ 
els. K-means and HC, even with the oracle choice of K and large n = 1000, 
fail to recover the true groups exactly. It is also interesting to note that their 
recovery percentages are flat around 0 as n increases, and the percentages 
for HC decrease for small n in Model 2 and 3. 

We proved in Section 6 that, whenever the group separation parameter r] 
and the threshold level a are both larger than appropriate multiples of 
log^'^^p. Algorithm 1 recovers consistently the block structure of R. 
Figure 1 shows that CORD recovers the true groups almost 100% of the 
runs when the condition r] > a + 2 t = 2n“^/^ log^^^p -|- 2r is satisfied. 
This condition translates into saying that the sample size is larger than a 
minimal sample size, shown as a vertical line in our plots. The recovery 
percentages for Cord show sharp rise around the calculated lower bounds, 
and the curves for p = 200 and p = 1600 almost overlap. This confirms 
empirically the phase transition by the scaling rate log^'^^p, observed 

in our converge rates in Corollary 1 and Theorem 2. 
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Figure 1: Percentages of exact recovery by K-means (K-Oracle, medium 
blue lines, triangle points), HC (H-Oracle, light yellow lines, cross points) 
and Cord (dark red lines, circle points) across 100 runs, whenp = 200 (solid 
lines) and p = 1600 (dashed lines). The minimal sample sizes are shown by 
light red vertical lines. All standard errors are smaller than 5%. 


Semi-parametric: Gaussian Copula 

(a) Model 1 (b) Model 2 (c) Model 3 





7.2.2. Approximate recovery. Perfect recovery is a strong requirement for 
any partition estimation algorithm, so we also investigate the performance 
of Cord compared to K-means and HC in terms of partial recovery. We 
measure the partial recovery performance in terms of the Adjusted Rand 
Index (ARI) [17], which is a continuous metric with values in [—1,+1] for 
comparing two partitions of a set. Two identical partitions yield an ARI 
value of 1, and random partitions are expected to yield a zero value. Figure 
2 shows that, in terms of partial recovery, our method is again a strong 
competitor of existing clustering algorithms in any regime: the ARI of CORD, 
a method that also estimates the number of communities, grows steeply 
in the small n regime, and becomes 1 fast, whereas the other algorithms, 
which require the number of groups as an input, level off at an ARI of 
approximately 0.8, even when the sample size increases substantially. 

7.3. Performance of Cord with data-driven a. 

7.3.1. Cross validation for the CORD threshold. The threshold a appearing 
in Corollary 1 does not involve any unknown quantity and they can be used 
directly. In practice, however, it is advisable to use a data-driven choice of the 
threshold for CORD. We propose to use the following type of cross-validation 
for this purpose. The idea is to construct a loss function over a grid of the 
a values, such that the value of a for which this loss has minimum value is 
also the value of a for which we have consistent recovery of our communities. 
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Figure 2: Average ARI by K-means (K-Oracle, medium blue lines, triangle 
points), HC (H-Oracle, light yellow lines, cross points) and Cord (dark red 
lines, circle points) across 100 runs. 


Semi-parametric: Gaussian copula 

(a) Model 1 (b) Model 2 (c) Model 3 





To this end, it is useful to introduce the following operator on correlation 
matrices. 

Given a correlation matrix R and a partition G, we introduce a block av¬ 
eraging operator T {R, G) which produces a G-block structured matrix of 
the same size as R. For any a G Gu and b G Gk', the output matrix entry 
G)]ab is given by 

\Gk\-^ (IGfel - l)-i Rij if a / 6 and fc = fe' 

\Gk\~^ \Gk'\~^ Rij if a / 6 and fc / fe' 

1 if a = 6. 

In essence, this operator averages over the submatrix of R with indices in 
Gk and Gk' respectively, except for the diagonal entries. It produces an 
estimator of R that has the G block structure. 

Given a matrix loss function L, and an estimator of R, our cross-validation 
(GV) procedure is as follows: 

For a sample of size n on X, we hrst randomly split it into two equal-size 
datasets: the training dataset and the validating dataset. We then compute 
the corresponding correlation estimates Rt and with indices referring to 
the training and validating datasets, respectively. 

For given D > 1, and each value ai, I = 1,D, on a grid, we use Algorithm 
1 to compute Gi, using Rt and ai as input. Our data dependent threshold 
is the grid value d given by 

(7) d = argminL , 


[riR,G)U={ 
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Table 1 

Average a values selected by our cross validation approach and the clustering 
performance of the cross validated a measured by the exact recovery percentages (ERP) 
and average ARI, under either the Gaussian (G) or Gaussian copula (GG) setting. All 
standard errors are smaller than 0.06. 


Model 1 Model 2 Model 3 

G GC G GC G GC 


a/(^n 

2.025 

2.208 

1.540 

1.630 

1.545 

1.595 

ERP 

88 

81 

100 

100 

100 

100 

ARI 

0.980 

0.964 

1.000 

1.000 

1.000 

1.000 


In our numerical studies, we use the Frobenius loss L {R,M) = ||i2 — M\\p. 
In the Gaussian setting, R is the sample correlation matrix, in the Gaussian 
copula setting it is sinusoid transformed Kendall’s tau in (4). 

The theoretical investigation of this criterion is beyond the scope of this 
work, and we present below its performance on a simulated example. 

7.3.2. Numerical studies. We use the case n = 1000 and p = 1600 to study 
the ’’large n, large p” performance of our cross validation approach. The 
training dataset and validating dataset have the same sample size 500 respec¬ 
tively. Since the theoretical value of a is proportional to log^^^p, we 

use a grid of a/log^^^p) = 0.25,0.5,..., 5. Table 1 shows that the av¬ 
erage a selected by GV is between 1.5 x log^^^ p and 2.2 x log^^^ p, 
close to our fixed choice 2 x log^^^p in the previous simulation study. 

Using the cross validated a in each run, CORD recovers exactly the true 
groups over 80% of the runs for Model 1, and 100% for Model 2 and 3. 
Under Model 1, CORD still achieves high average ARI values, larger than 
0.964. The performance metrics under Model 1 decrease only slightly for the 
Gaussian Gopula setting compared with the Gaussian setting. 

We show the relationship between the average GV losses and exact recovery 
percentages in Figure 3. This shows that the optimal ranges of a values for 
high exact recovery percentages are also associated with low average GV 
losses. All these ranges are around 2n“^/^ log^^^ p in all models. 

7.4. The Cord metric for correct community placement of pairs of vari¬ 
ables. The Cord method can be viewed as a method for clustering the p 
rows (and columns) of R. Most clustering methods are applied relative to 
a given distance or metric, where the metric is chosen such that pairs of 
objects are grouped together if it is small, and placed in different groups if 
the metric is large. 

In this section we validate numerically that the Cord metric given by (5) can 
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Figure 3: Average CV losses (black dotted lines, plus points) and exact 
recovery percentages (red solid lines, circle points) across 100 runs. The 
minimal average CV losses are shown by gray vertical lines. 


Semi-parametric: Gaussian copula 

(a) Model 1 (b) Model 2 (c) Model 3 
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alpha/sqrt(log(p)/n) 


alpha/sqrt(log(p)/n) 


alpha/sqrt(log(p)/n) 


indeed be viewed as a bona fide variable clustering metric, when communities 
have nontrivial size, larger or equal to 2, and are generated from the G- 
models introduced and studied in the previous sections. The study presented 
below is therefore not of the Cord algorithm per se, but rather of its metric, 
which is the crucial ingredient of our method. To assess correct community 
assignment of pairs of variable, we compare the number of false discoveries 
to that of discoveries (we define below what each means in this context). 
Given the data, we pair two variables (a, b) if CORD(a, b) < v, where n is a 
given threshold value. We call such a pair a “discovery”. We recall that CORD 
is given by (5), with R an appropriate estimator of R. Of these discoveries, 
some are false, in that Xa and Xb do not belong to the same group in the 
true partition. Formally, we let 


(8) FD{v) = |{C^(a,6) < v, a b,a> b}\, 

D{v) = |{CORD(a, 6) <v,a> b}\. 

Both FD{v) and D[v) increase with v, and we want to study the effect of 
increasing v on the number of false discoveries FD relative to the number 
discoveries D. Figure 4 compares the average number of false discoveries 
FD as we vary the threshold v such that the percentage of discoveries D 
reaches up to 10% of the total number of entries in the lower triangular part 
of the correlation matrix. The results correspond to n = 500 and p = 1600, 
chosen as a representative case. This figure shows that the CORD metric 
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Figure 4; Average numbers of false discoveries (FD) and discoveries (D) 
after thresholding either the CORD (red solid lines) or correlation (blue 
dotted lines) metrics. The shaded areas show meaniSD. 


Semi-parametric: Gaussian copula 

(a) Model 1 (b) Model 2 (c) Model 3 
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yields almost zero false positives, for a relatively high range of the threshold 
values. ^ 

We also compared FD and D with Cord in (8) replaced by the estimated 
negative correlation, which is either the Pearson’s correlation in the Gaus¬ 
sian setting, or the negative Kendall’s tau in the Gaussian copula setting. We 
recall that the former is equivalent with using the £2 distance between ob¬ 
servations on two variables. These metrics are the ingredients of the variable 
clustering algorithms used for comparison in the previous two subsections. 
Figure 4 shows that for our G-latent models, the negative correlation (Pear¬ 
son’s or Kendall’s tau) metric can yield as many as 40 000 false positives 
out of a total of over 120 000 discoveries. This suggests strongly that the 
commonly used metrics, irrespective of the algorithm that employs them, 
are not optimal for G-exchangeable, G-block and G-latent models, whereas 
the Cord metric adapts to this structure. 

8. The analysis of Standard & Poor 100 Stocks. To illustrate the 
applicability of the G-exchangeable models for defining communities of vari¬ 
ables, we apply Cord for the analysis of a stock dataset, in order to study 
which stocks are grouped together. We compare the resulting partitions 
with those obtained from the K-means and HC algorithms, respectively. 
This dataset contains daily returns of the stocks listed in the Standard & 
Poor 100 index (as of March 21, 2014), from January 1, 2006 to December 
31, 2008. The stocks with incomplete data, due to company reorganization 
for example, are removed from the analysis, and this leaves 91 stocks with 
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returns from 755 trading days in total. 

We first compare the groups recovered by Cord, K-means, and HC. We 
use the cross validation approach of Section 7.3.1 to select a, and it selects 
the threshold a = log^^^p with K = 30. We thus set ii' = 30 in 

Kmeans and HC (using the negative Kendall’s tan distance) to compare the 
clustering performance. Table 2 lists the stocks in the first 5 CORD groups. 
Because the partitions from K-means and HC are different, we list all the 
groups from K-means and HC that contain these stocks for each CORD 
group. It is clear that CORD groups together directly competing companies 
that offer similar or almost identical services, while K-means and HC can 
group companies that provide different kinds of services and products. For 
example. Home Depot and Lowe’s are the only two companies in a CORD 
group, probably because they are both home improvement stores. However, 
Starbucks is added to this group by both K-means and HC, even though 
Starbucks is a coffee shop chain. Further comparison is provided in Section 
3 of the supplemental material [4]. 

An additional application of the CORD method to an fMRI data analysis 
is presented in Section 4 of the supplemental material [4]. 
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Industry 
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HG 
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Equipment 
and Service 

Anadarko 
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Devon Energy, 
Halliburton, 
National Oillwell 
Varco, Occidental 
Petroleum, 
Schlumberger 

Anadarko 
Petroleum, 
Devon Energy, 
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National Oillwell 
Varco, Occidental 
Petroleum, 
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Apache, 
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Anadarko 
Petroleum, 
Devon Energy, 
Halliburton, 
National Oillwell 
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Schlumberger, 
Apache, 
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National Oillwell 
Varco 

Freeport- 

McMoran 




AT&T, Verizon, 


2 

Telecom 

AT&T, Verizon 

Pfizer, Merck, 
Lilly, 

Bristol-Myers 

AT&T, Verizon 
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Railroads 

Norfolk Southern, 
Union Pacific 

Norfolk Southern, 
Union Pacific 

Norfolk 

Southern, Union 
Pacific, Du Pont, 
Dow Chemical, 
Monsanto 





Home Depot, 
Lowe’s, 

4 

Home Im¬ 
provement 

Home Depot, 
Lowe’s 

Home Depot, 
Lowe’s, Starbucks 

Starbucks, 
Costco, Target, 
Wal-Mart, 
FedEx, United 
Parcel Service 

5 

Air Freight 
& Logistics 

FedEx, United 
Parcel Service 

FedEx, United 
Parcel Service, 
Caterpillar, Du 
Pont, Dow 
Chemical, 
Emerson Electric, 
General Electric, 
Honeywell, 3M, 
Nike, United 
Technologies 

Home Depot, 
Lowe’s, 
Starbucks, 
Costco, Target, 
Wal-Mart, 
FedEx, United 
Parcel Service 
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APPENDIX A: PROOFS 

A.l. Identifiability in G-exchangeable models: Proof of Theorem 

1 . The first two statements follow by the definition of G-exchangeability 
and that of G A G'. 

We begin by proving the third claim. For this, we first show that any per¬ 
mutation a G SqaG' can be decomposed into transpositions a = Ti...Td with 
each transposition r* permuting two elements a*, bi belonging to the same 
group of G or G'. Below, we write Tab for the transposition between a and 
b. We need the following lemma. 

Lemma 3. Let ci,... ,Ck be k > 2 distinct points. Then the transposition 
TciCfc can be decomposed into Tc^c^. = ''"cj Cj.+i with N < 2k and 1 < 

ji<k - 1. 

Proof of the lemma. For fe = 2 it is obvious. For k = 3 we have Tcica = 
TC 2 C 3 TC 1 C 2 TC 2 C 3 ■ Assume that the property holds up to k elements. We have 
TciCfc+i = Tc^Ck+iTciCf,rc^Ck+i and applying the induction hypothesis to 
we obtain the property at level A: -|- 1. The proof of the lemma is complete. 
□ 

We first observe that a permutation a G <SgaG' can be decomposed into 
transpositions a = ti...T£, each transposition acting on a single class (G A 

^ 1 G or G' G or G' G or G' G or G' 

G jfc. Lor any a, 0 G (GAG jfc we have a ~ ci ~ ... ~ Cd ~ 

b so according to the lemma, the transposition Tab can be decomposed as a 

product of transpositions, where each transposition is between two elements 

which are in a same Gk or G'^. So, any permutation a G SqaG' can be 

decomposed into transpositions permuting elements belonging to a same 

group of G or G'. 

Let us conclude. If r is a transposition between two elements of a common 
cluster Gk or G'^, then X and X' = X^ have the same distribution by hy¬ 
pothesis. So, for any permutation a, the variables X^- and X'^ = Xto- have 
the same distribution. So, by induction, using the above decomposition, we 
obtain that for any permutation a G 5gaGA the variables X^ and X share 
the same distribution. Hence X ~ G f\G'. This concludes the proof of the 
third claim. 

We conclude the proof of this proposition by proving the fourth claim. The 
set £{X) is non-empty since the trivial partition G = {{1} ,..., {p}} be¬ 
longs to £{X). It is also a finite set, and we can enumerate it £{X) = 
{Gi,..., Gm}- Define the sequence G'^,..., G'j^ recursively according to 
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• G; = Gi, 

• G'fc = Gfe A G'fc_i for fe = 2,..., M. 

According to Proposition 1, we have by induction that , G'j^ G ^i^)- 

In addition, we have both < G'^_^ and G'^ < G^, so by induction < 
Gi,..., Gfc. Hence, the partition G'^(X) := G'j^ = Gi A G 2 A ... A Gm-i is 
the minimum of £{X). □ 

A.2. Minimax lower bounds: Proof of Theorem 2. Our proofs are 
based on a version of Fano’s lemma suited for our application. Specifically, 
we will employ Dirge’s Lemma (Corollary 2.18 in [22]), which we state below, 
translated to our problem. 

Lemma 4. For any partition estimator G, and for any collection of distinct 
G^^^-block covariance matrices G 1 < i < M, we have 

sup Pr{G^G^^°^^) > 

RGlZiri) 

( 9 ) > 

( 10 ) > 

where IC{R^^\ R^^^) denotes the Kulback-Leibler divergence between two Gaus¬ 
sian likelihoods based on n observations. 

Proof. Inequality (9) is Dirge’s Lemma (Corollary 2.18 in [22]), translated 
to our problem. We prove inequality (10) below. We only need to check that 

(11) -R^^^)f/2. 

We have 

= I (Trace{{R^^^)-^R^^^ -I)- logdet (^{R^^'>)-^ 

= |(F((i?«)-ii?(^))-F(/)). 

with F{S) = Tracers) — log det{S). Notice that F is convex, and therefore 
F{I + H)- F{I) <(/-(/ + 


.max^P^0)(G^G(l)) 


1 


2e + 1 
1 

2^ 1 


A 1 — max 


/C(i?(l),i?W)' 


A I 1 — max 
i>2 


i >2 log(M) J 


21 og(M) 
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since the gradient of our function F in I + H is I — {I + H) Let ai > 
1 T 2 > ... be the singular values of H. Then 

(/-(/ + H)-\H) = Y, = \\Hf- 

k k 

Consequently, (10) holds, and the proof of this lemma is complete. □ 

The task therefore is that of constructing M correlation matrices in lZ{r]) 
and determining rj* for which either (9) or (10) in Lemma 4 is at least 2 e+i • 
Whereas the proof of Theorem 2 would be completed by constructing one 
instance of these M matrices, we will provide three such constructions, in 
order to emphasize the generality of our results. We begin by constructing a 
sequence of matrices with K = 2 blocks, one of size 2 and one of size p — 2. 

Proposition 7. Let us fix 0<a<l3<l. We construct a sequence of 
correlation matrices in TZ{r]) as follows. To any pair a b in {1,... ,p}, we 
associate the matrix with 1 on the diagonal and a off the diagonal, 

except for positions {a,b) and {b,a) where we define = fi. The 

partition cbiock{a,b) associated to is then {{a, 6} ; {1,... ,p} \ {a, 6}}. 

When 

noN /log(p(p - l)/ 2 ) 

(2 + e-l)» ' 


then 

inf sup Pr(G > infmaxP^(.. 6 )(G . 

G R&n{ri) G 2e + 1 

Proof of Proposition 7. First, notice that the cardinality of the set 
: a b e {1,... ,p}} is M = p{p — l)/2, and all matrices G 

77(r/), for any p < fi — a. We denote by J the matrix with all entries equal to 
1, which is positive semi-definite. The matrix —aJ has 1 — a on the di¬ 
agonal, 0 off the diagonal, except in (1, 2) and (2,1) where it takes value f3—a. 
The lowest eigenvalue of — aJ is 1 —/3, hence |(P(^’^^)“^|op < (1-/3)“^. 

We also observe that = 4(/3 — a)^ for any (a, 6 ) 7 ^ (1,2). 

Hence 

||(77(1’2)^-1(^(1,2) _ ^(a,b))||| < 4 ■ 
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Then, when (12) holds, inequality (10) of Lemma 4 above gives the desired 
result: 


irif maxP^(.,„(a / A 

Q a^b Ze + 1 

1 

“ 2e + r 


1 - 


2n{(3 — a)^ 


(1 -/3)2log(p(p- l)/2) 


□ 


We now construct a second sequence of matrices with many small blocks. In 
this example, we have K = p/2 blocks of size 2, assuming that p is even. 

Proposition 8. Let be the matrix = aJ + diag{A, ■ ■ ■ ,A) where 
J is the matrix with all entries equal to 1 and 


(\ — a f3 — a\ 
\/d — a 1 — a J ' 


The partition associated to R^^^ is 2} ;...; {p — \,p} 

We denote by Tab the transposition between a and b in {1,... ,p} and by 
the matrix ^ ^ - 


( 1 ) 

TabiWabU) 




built from R^^'i by permuting the 


rows and columns indexed by a and b. The partition associated to 

this matrix is the same as except that the indices a and b have been 

switched. Then, when 


(13) 


/3-« ^ / log(p(p - 2)/4y 
1 — /3 “ Y (4 + 2e“^)n ’ 


no estimator can perfectly recover the partition uniformly on the set of 

distributions {P^(i)} U {P^(Q,i,) : a = 2k, k = 1,... ,p/2, b > a], with prob¬ 
ability larger than 2e/(2e + 1). 


Proof of Proposition 8. In this case, the cardinality of the set 
■.a = 2k, k = l,... ,p/2, 6 > a} U 

is M = p(p—2)/4+l, and R^h) g TZ(g) for any a and /? such that g < fi — a. 
As in the previous example, since the lowest eigenvalue of A is 1 — ;0, we have 


imsart-aos ver. 2014/10/16 file: C0RD-REVISI0N.Aprill.2016.tex date: April 1, 2016 










VARIABLE CLUSTERING WITH CORD 


33 


^|op<(l —/3) We also observe that = 8(/3 — a)^. 

Therefore, when (13) holds, Lemma 4 gives 


inf 

G 

> 


max 

ae{2,4,...,p}, h>a 


1 


2e + 1 


A 1- 


4n(/3 — a)^ \ 1 

(1 - /3)2log(p(p - 2)/4)y “ 2e + 1‘ 


□ 


By analogy with clustering via SBM, we may expect that when all clusters 
have a common size m, the clustering problem become easier and easier as 
m grows. The following example shows that it is not the case in general, 
even when m grows linearly with p. 


Proposition 9. Let 0 < e < 1 and define 


C{e) = 





whieh is positive semi-definite. Consider the following covariance matrix 
with K = 3 blocks of equal size m = p/3 given by S := AC{e)A^ + I, 
where A is a p x 3 hard assignment matrix given by Aji = 1, if 1 < j < m, 
and zero otherwise, Aj 2 = 1, if rn-\-l<j< 2 m, and zero otherwise and 
Aj 3 = 1, if 2m + 1 < j < 3m, and zero otherwise. For a G {l,...,m} 
and b G {m + 1,..., 2 m + 1}, we construct by permuting the rows and 

columns indexed by a and b. We collect the M = m? + 1 candidate matrices 
in a set A4. Let G he the partition associated with one of the generic 
matrices S' G A4. If 

/0.81og(p/3) 

e < V- and n < 0.8e, 

V n 

then 


inf sup PniG > inf max Ps'(G / 

G R£Tl{ri) G ^'&M 

1 

“ 2e + 1 ■ 


Proof of Proposition 9. We begin by noting that S given above is not 
a correlation matrix, but since DiagonaliYl) = (1 + e,1 + e, 3,..., 3), its as¬ 
sociated correlation matrix R will have the same block structure. Moreover, 
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since MCord(S) = 2e, then MCORD(i?) = ce, for some c G [2/\/6, 2/-v/3], 
and R G TZir]), for any i] < 0.8e. To avoid the notational clutter that would 
be introduced by normalizing S, we will work in what follows with covariance 
matrices, which does not affect the proof. 

We will once again appeal to Lemma 4, and we calculate below the KL- 
divergence S). By symmetry, we can assume that a = 1 and b = 

m + 1. We write henceforth Q for the permutation matrix associated to the 
transposition of 1 and m + 1 and S' = QTiQ^. We start by writing out the 
eigenvalues and the corresponding eigenvectors of C{t)-. 

• Ai = e (2 — e) with ^ 0 ] 

• A 2 = 2 + with ul = 1 2 /e] 

• Ag = 0 with = ^^[1 -1 e] 

Let = -^{Auk)'^^ for fe = 1, 2. Then, with I denoting the p x p identity 

matrix, we have 


S = ^ mAfcUfcUfc + I, S ^ = - X] 1 + I 


k=l 

2 


k=l 


1 + mXk 


S' - S = ^ mXk[QvkvlQ 


T Ti 

VkVk ] 


k=l 


WehaveQufc = with Af’ = ;^[(nfc) 2 -(ufc)i, 0...0, (ufc)i-(nfc) 2 , 0...0]. 

Then, Ai = 0, A^ = -^=|=^[1, 0...0,-1, 0...0], and 

S — S = niX2{v2^2 T A 2 U 2 + A 2 A 2 ). 

We notice that vivf A 2 = 0 and vivj'v 2 = 0, so putting pieces together: 
S"^S' = / + S"^(S' - S) = / + mX2M, 


where 


M := {I - "'"^2 _|_ A 2 V 2 + A 2 A 2 )• 

1 + mX2 

Writing s := A^U2, 7 := A|'A2 and p := 171X2/ (1 + ^1X2), we have 
M = [U 2 A 2 + A 2 V 2 + A 2 A 2 - p(v2A2 + SV 2 V 2 + SV 2 A 2 )] 

= V2AI{1 - p{l + s)) + A2 vJ + A2A^ - PSV2V2. 
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Let us compute the eigenvalues of M. We observe that the range of M is 
spanned by V 2 and A 2 , so we seek for eigenvectors u = V 2 + aA 2 . Since 


(14) 


26 ^ 

m (2 + e^) 


computing Men, we obtain 


2 e^ 
m\2 ’ 


and 7 


4e" 

mX2 


-2s, 


Mu = [(1 - p{l + s))(s + 07 ) - ps{l + as)]v 2 + [1 + as + s + a 7 ]A 2 
= [(1 - p{l + s))(l - 2 a) - p{l + as)]su 2 + [1 + (1 - a)s]A 2 , 


so 


Mu = p,u 


1 + (1 — a)s = ap 

[(1 - p{l + s))(l - 2 a) - p{l + as)]s = p 




1^0 — P^ p\ps{2 + s)] + (1 — p'){s + 2 )s. 
Therefore, the two non-zero eigenvalues pi and p 2 of M fulfill 


Pi+ P 2 = -ps{2 + s) 
P 1 P 2 = (1 - p)s{2 + s). 


Since S = I + mX2M, with M of rank 2 , we can now compute /C(S', S): 

S) = Tr(mA2M) - log det(/ + mAaM) 
n 

= mX2{pi + P2) - log(l + mX2{pi + P2) + {'mX2fpiP2) 

= —mX2ps{2 -I- s) — log (1 — mX2ps{2 -|- s) -|- (mA2)^(l — p)s {2 + s)) . 


We observe that mA 2 (l — p) = p, so the terms in the log cancel and finally, 
plugging the values in (14) into the above display, we obtain 

/C(S', S) = -^mX2ps{2 -L s) = 2npe^ (l -—< 2ne^. 

Since we have M > (p/3)^ candidate matrices. Lemma 4 ensures that when 
^ — Y^ ^( 2 e+f)^ probability of mis-clustering is larger than l/( 2 e -|- 1 ). 

□ 
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A. 3. Proof of consistent recovery. Proof of Proposition 6. First, we 
notice that, since — i?|oo = we have 

\Rac Rbc\ 2t ^ \Rac Rbc\ ^ l-^ac Rbc\ “1“ 2t, 

for any a, b, c. Hence CORD(a, b) — 2 t < Cord(o, b) < Cord(o, b) + 2r. We 

Qblock - 

then observe that a ~ b => CORD(a, 6) = 0 CORD(a, 6) < 2r, 

and when the group separation condition R £ TZip) holds 

a oo b Cord(o, 6)>?7 CORD(a, 6 ) > ry — 2r. 

In particular, under the condition (6) and the group separation condition 
R G TZii]), we have 

/~iblock --- 

(15) a ^ b <;=> CORD(a, 6 )<a. 

Let us prove the proposition by induction on 1. We consider the algorithm 
at some step I and assume that the algorithm was consistent up to this step, 

If IS I = 1, then it directly follows that G = . Assume now that |S| > 1. 

(i) If CORD(ap bi) > a, then according to (15) no 6 G S is in the same group 
as ai- Since the algorithm has been consistent up to step I, it means that a/ 
is a singleton and Gi := {ai} = 

—-— Qblock 

(ii) If CORD(a;, 6 /) < a, then ai ~ 6 / according to (15). The equiva¬ 
lence (15) furthermore ensures that Gi = Since the algorithm has 

been consistent up to this step we have C S and hence Gi = 

To conclude, the algorithm remains consistent at step I and the proposition 
follows by induction. □ 

A.4. Proofs for G-latent models. 

A.4.1. Proof of Proposition f. Let G^ be minimal in C{X). Since C{X) C 
B{X), then G^ is a sub-partition of := G^^°^^{X), that is G^ < G^. 
Write respectively for the index function associated to the partition 
G^, respectively G^. Similarly, we write G^ for the G-matrix associated to 
G^ according to Definition 6 and G^ for the one associated to G^. According 
to the very definition of G^ and G^, for every a 7^ 6, we have 

^kP(a)kP{b) ^ ^ ^kpa)kpb)- 
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Since X is not G^-latent, the matrix is not positive semi-definite: There 
exists ^ 0 such that < 0. We next show that it enforces 

to have at least one singleton. 

Let X be the size of and x be the size of . Since is 
a sub-partition of up to a relabelling of the groups in we have 

Gf = Gi U ... U Gk^ , • • ■, G^^ = Gki+...+k^i3_^+i U ... U Gk^^ ■ 

We can associate to the vector G the vector x^ G defined by 
x^ = (xf,0 ,...,0,xf,0, ...) 

where is located at index i{k) = Ki + ... + Kk-i + 1. For any k ^ j, 
taking a G G^^^^ and b G G\j), Equation (16) gives G^- = Assume 

now that G^ has no singleton: For any k G we can choose 

a / 6 in Gyi^y For this choice of a and b, the Equation (16) gives G^^ = 
So C^j = for all k,j G {1,...,A:^}. Hence, we obtain 

{x^Y'C^{x^) = {x^yG^x^ < 0. This is impossible since X is G^-latent, so 
G' is the covariance matrix of the latent variables and hence it is positive 
(semi-)definite. We conclude that G^ has at least one singleton. □ 
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Abstract The following document provides proofs of some of the 
lemmas, propositions and corollaries in [3]. It also contains additional 
simulation results and an additional data analysis. The numbering of 
lemmas, propositions and corollaries will be the same as in [3]. Results 
in [3] will be used without explicit reference. 


1. Supplemental proofs and theoretical results. 

1 .1. Proofs for Section 2. 

1.1.1. Proof of Lemma 1 of Section 2.3. The first two properties follow di¬ 

rectly from Definition 4. The third statement follows by a simple calculation, 
since 'Pgui the covariance matrix of {Xa)a^Gki has dk on the diagonal and 
Ckk outside of the diagonal, and hence dk — Ckk and dk + {\Gk\ — ^)ckk as eigen¬ 
values. Since is positive semi-definite, its eigenvalues are non-negative, 
which gives the last inequalities. □ 

1.1.2. Proof of Proposition 1 of Section 2.3. 1) We first show that := 

Qbiock^j^-j i^eiongs to B{X). By Definition 5, we have that var{Xa) = var{Xb) 

f^l3 /-ijS f-ijS 

for a ^ b. Also, for a ^ b, a' ^ b' with a b, a' b', and a a', b b', 

we have 

COv{Xa,Xa') = COv(Xb,Xa') = COv{Xb,Xb') 

if 6 7^ a'. If 6 = a' and a b' then a ~ a' = 6 ~ 6' in so cov{Xa, X^/) = 

cov{Xa, Xb) = cov{Xa, Xb') since b ^ b' and a / b,b'. If 6 = a' and a = b' 
the equality is obvious. Therefore, Definition 4 is met and we conclude that 
S is G^-block structured. 

1 
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2) Next, we show that < G, for any G G B{X). For this, we need to 

G 

show that whenever a ^ b, then a ^ b, which will establish that G is a 
snb-partition of G^. If a ~ 6 we have, by Definition 4: 

Q 

• var(Xa) = nar(Xft), since a ^ b. 

G G 

• cov{Xa, Xc) = cov{Xi„ Xc), for any c ^ a,b, since a ~ 6 and c ^ c. 

gp 

Therefore, a ^ b, which conclndes the proof. □ 

1.1.3. Proof of Lemma 2 of Section 2.5.1. Let Z be a Gaussian AA(0,C) 

random variable and Ea be independent random variables, independent of 
Z and with M{Q,Tiaa — G^a)k{a)) distribntion. Setting X'^ = + Ea 

we observe that X' follows a A^(0, S) Gaussian distribution, so X = X' in 
distribution. □ 

1.1.4. Proof of Proposition 5 of Section 2.5.2. (i) We only need to prove 

<;=. If X has a G-block covariance, then for any a G Sq we have cov{Xa) = 
cov(X) so the distribution of X^ is AA(0,S). (ii) The second equivalence 
follows from Proposition 3 and Lemma 2. (iii) The first part follows imme¬ 
diately from above. For the second part, let G^{X) be a minimal element of 
C{X). Since C{X) C B{X), then G^{X) < G\X). Since X is G^(X)-latent, 
by Proposition 4, and G^{X) is minimal, we also have G^{X) < G^^°'^^{X). 
Therefore G^{X) = G^^“^(X) and the proof is complete. □ 

1.2. Proof of Corollary 1 of Section 6. Hoeffding concentration inequality 
for U-statistics [12] gives 

P {\^ab — Tab\ > t) < for all o < 6 and t > 0 . 

If R is given by (5), and since x —)> sin( 7 rx/ 2 ) is ( 7 r/ 2 )-Lipschitz, we obtain 
for all t > 0 

P (^[.R - R\oc > t) < 

Taking t = 27r-\y(1 -|- A) log(p)/n, the result then follows from Proposition 
6, since r < (1 -|- with probability higher than 1 — □ 

1.3. A corollary for Gaussian distributions . 

Corollary 1. Let us consider A > 0 and {a^rf) such that 

1(1 +A) log(p-L 1 ) ^ , ir- /(I + ^) log(p + 1 ) 

a>16\ - and rj > a + 16\ -. 

V n V n 
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IfX has a zero mean Gaussian distribution with correlation matrix R G 
then the output of our algorithm applied to the sample correlation 
matrix R is consistent: G = with probability higher than 1 — p~‘^^. 

We remind the reader that the scaled variable Y = scalefX.) is defined by 
Ya = Xaj for o = 1,...,p. We Write S for the covariance matrix 

of Y, which coincides with the correlation matrix of X, i.e. S = R. We also 
write S for the (unobserved) empirical covariance matrix of Y. The following 
lemma connects the sup-norm of i? — i? to the sup-norm of S' — S'. 

Lemma 1. We have — i?|oo < 2|S — S|oo- 

Proof of the lemma. Since the empirical correlation matrices of X and Y are 
equal, we have Rab = Sab/{SaaSbbY^'^ ■ Since R = S, the triangle inequality 
gives 

\Rab - Rab\ = \Rab{l " {SaaSbb)^^^) + Sab " Sab\ 

< \Rab\\l-{SaaSbb)^^^\ + \Sab-Sab\. 


We notice that 

|1 - (SaaSbb)'/'! < |1 - Saa\ V |1 - Sbb\ = \Saa - Saa\ V \Sbb - Sbb\. 
Since \Rab\ < 1, we conclude that for any a, 6 G {1,... ,p} 

\Rab - Rab\ < \Rab\ (^|Saa “ Saa\ V \Sbb “ Sbb\^ + |Safe - Saf,] 
<2|S-S|oo. 


The proof of Lemma 1 is complete. □ 

When 16(1 -|- ^4) log(p -|- 1) < n, the classical bound on Gaussian covariance 
matrices (see e.g. [11], p.l59) 

P(|S — Sjoo > t) < p{p -|- l)e“"'* for any 0 < t < 1, 

ensures that |S — Sjoo < 4y^(1 -|- A) log(p -|- l)/n with probability at least 
1 — p~‘^^. Hence, according to Lemma 1, we have 

(1) \R - .Rjoo < 8a/(1 -L A) log(p -L l)/n 

with probability at least 1—Since |i? —< 2, the bound (1) remains 
valid when 16(1 -|- ^)log(p -|- 1) > n. The conclusion of Corollary 3 then 
follows from Proposition 6. □ 
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2 . Supplemental numerical results for simulated Gaussian data. 

We collect here the numerical results obtained in the Gaussian case. They 
are very close to those obtained for simulated Gaussian copula data. Figure 1 
shows the percentage of exact recovery, the ARI index, and the performances 
of GV. Figure 2 displays the average number of False Discovery. 

3. Supplemental results for the Standard & Poor 100 Stocks ex¬ 
ample. To illustrate the intuition behind the Cord metric, we compare 
the estimated correlation matrices corresponding to stocks, after reorder¬ 
ing the variables representing the stocks by the group labels recovered by 
the three methods under evaluation. Because the group labels are arbitrary, 
to make the comparison as fair as possible, we use an approximate match¬ 
ing method [15] to match the labels of the other methods to each one of 
Cord. Figure 3 plots the estimated Kendall’s tau correlation matrices after 
re-ordering by the matched labels. Visually, it is clear to see that the CORD 
partition shows mosaic block patterns for both the diagonal and off-diagonal 
parts of the correlation matrix. The other methods with criteria based on 
intra-class distances mainly show diagonal blocks, and there could be mul¬ 
tiple bands within each group because the variables in groups obtained by 
either K-means or HG can have different correlations with all other vari¬ 
ables, whereas for CORD, variables in a group have the same, high or low, 
correlation with all other variables. 

4. A functional MRI example. Using functional MRI data, [23] found 
that the human brain regions are organized into communities, sometimes 
referred to as networks. We use a publicly available fMRI dataset to illustrate 
the communities recovered by CORD. The dataset was originally published 
in [26] and is available from Open fMRI (https://openfmri.org/data-sets) 
under the accession number ds000007. We will focus on analyzing two scan 
sessions from subject 1 under a visual-motor stop/go task (task 1). Before 
performing the analysis, we follow the preprocessing steps suggested by [26], 
see Section 4.1 below for details. 

Using the first run data alone, the same CV procedure yields K = 80 with 
a = log^'^^p, close to the fixed constant 2 in our simulations. We 

thus set iA = 80 in K-means and HC, and use the same procedure to match 
the groups across the methods. The correlation matrices after reordering the 
variables are plotted in Figure 4. By visual comparison, the CORD groups 
are aligned with the bands in both rows and columns of the correlation 
matrix, while the other methods focusing on recovering diagonal blocks can 
have multiple bands in the same group. 


imsart-aos ver. 2014/10/16 file: Supplementtosubmission.Aprill.2016.tex date: April 1 


2016 


SUPPLEMENT TO: VARIABLE CLUSTERING WITH CORD 


5 


Figure 1: Top: Percentages of exact recovery by K-means (medium blue 
lines, triangle points), HC (light yellow lines, cross points) and CORD (dark 
red lines, circle points) across 100 runs, when p = 200 (solid lines) and 
p = 1600 (dashed lines). The minimal sample sizes are shown by light red 
vertical lines. All standard errors are smaller than 5%. Center: Average 
ARI by K-means (medium blue lines, triangle points), HC (light yellow 
lines, cross points) and CORD (dark red lines, circle points) across 100 runs. 
Bottom: Average CV losses (black dotted lines, plus points) and exact 
recovery percentages (red solid lines, circle points) across 100 runs. The 
minimal average CV losses are shown by gray vertical lines. 


(a) Model 1 


Exact recovery: Gaussian case 

(b) Model 2 


(c) Model 3 




(d) Model 1 


Partial recovery: Gaussian case 

(e) Model 2 


(f) Model 3 





(g) Model 1 


Cross-validation: Gaussian case 

(h) Model 2 


(i) Model 3 



alpha/sqrt(log(p)/n) 



alpha/sqrt(log(p)/n) 



alpha/sqrt{log(p)/n) 
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Figure 2: Average numbers of false discoveries (FD) and discoveries (D) 
after thresholding either the CORD (red solid lines) or correlation (blue 
dotted lines) metrics. The shaded areas show meaniSD. 


Parametric: Gaussian 

(a) Model 1 (b) Model 2 (c) Model 3 
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Figure 3: Comparison of correlation matrices (black=l) when the variables 
are ordered by the groups recovered by (a) CORD, (b) Kmeans, and (c) HC. 
The variable groups are denoted by color sidebars. 


(a) Cord 



(b) K-means (c) HC 
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Figure 4; Heatmaps (black=l) of correlation matrices after re-ordering the 
variables according to the recovered groups by CORD, K-means, and HC. 
The order of groups from K-means and HC are approximately matched to 
that of Cord, and each group is denoted by color using the side bars. 


(a) Cord 


(b) K-means 


(c) HC 





The largest Cord group contain 35 ROIs. We use the brain function clas¬ 
sification map available at http://www.brainnexus.com/resources/resting- 
state-fmri-templates to find the functioning of these ROIs. This CORD group 
includes 14 visual areas, 14 motor areas, 4 salience network areas, and 3 ex¬ 
ecutive areas. The majority of the ROIs in this cluster correspond to visual- 
motor functioning, which is well expected to coordinate together during the 
visual-motor task. The salience and executive areas are also expected to be 
involved during the task, see a review [25]. We refer to Section 4.1 below 
for additional results. 

Because there are no gold standards for partitioning the brain, we use a 
prediction criterion to compare CORD with K-means and HC. We compute 
the correlation matrices Ri and R 2 from the first and second session data 
respectively. For a grouping estimate G, We use the following loss to evaluate 
its performance 


( 2 ) 


R2-r{Ri,G 


For a fair comparison, we show in Figure 5 the losses against different group 
sizes, where the groups are estimated by CORD, Kmeans, and HC respec¬ 
tively. Here, we use Kendal’s r for computing the correlation matrices, and 
the results for Pearson’s correlations are similar. Again, Cord outperforms 
all other methods for almost all group sizes. Compared with K-means and 
HC, Cord yields the smallest loss values for a wide range of K, and the CV 
selected K = 80 yields the smallest loss. 
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Figure 5: Comparison of CORD, K-means, and HC using two session data, 
using the Frobenius prediction loss criterion (2) where the groups are esti¬ 
mated by these methods respectively. The group sizes for the fixed and CV 
choices are shown as solid and dashed vertical lines respectively. 



4.1. Materials on the fMRI data analysis. 

Preprocessing. We applied the preprocessing steps suggested by [26] , which 
includes slice timing correction, alignment, registration, normalization to 
the average 152 T1 MNI template, smoothing with a 5mm full-width-half- 
maximum Gaussian kernel, denoising using the FSL MELODIC procedure, 
and a high pass filter with a 66s cut-off. The event-related activation and 
temporal correlation were removed using general linear models (GLM) for 
each voxel [9]. Following [23], we extract 180 mean activities within a 10mm 
spheres centered around each of 264 putative functional areas (see Table S2 
from [23]). 

Additional results. We plot in Figure 6 the brain ROIs and their CORD 
clustering memberships. 

The MNI coordinates (in mm) of the ROIs in the largest CORD are included 
in Table 1. 
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Figure 6: Brain ROIs and their CORD groups (shown by color nodes) under 
the (a) sagittal, (b) axial, and (c) coronal views. 


(a) Sagittal 


(b) Axial 


(c) Coronal 





Table 1 

MNI coordinates (x, y, z, in mm) of the largest CORD group and their functioning 

classification. 
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motor 
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motor 
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